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ABSTRACT 

Action recognition is an important problem in multimedia under¬ 
standing. This paper addresses this problem by building an expres¬ 
sive compositional action model. We model one action instance 
in the video with an ensemble of spatio-temporal compositions: a 
number of discrete temporal anchor frames, each of which is further 
decomposed to a layout of deformable parts. In this way, our model 
can identify a Spatio-Temporal And-Or Graph (STAOG) to repre¬ 
sent the latent structure of actions e.g. triple jumping, swinging and 
high jumping. The STAOG model comprises four layers: (i) a batch 
of leaf-nodes in bottom for detecting various action parts within 
video patches; (ii) the or-nodes over bottom, i.e. switch variables 
to activate their children leaf-nodes for structural variability; (iii) 
the and-nodes within an anchor frame for verifying spatial com¬ 
position; and (iv) the root-node at top for aggregating scores over 
temporal anchor frames. Moreover, the contextual interactions are 
defined between leaf-nodes in both spatial and temporal domains. 

For model training, we develop a novel weakly supervised learning 
algorithm which iteratively determines the structural configuration 
(e.g. the production of leaf-nodes associated with the or-nodes) 
along with the optimization of multi-layer parameters. By fully 
exploiting spatio-temporal compositions and interactions, our ap¬ 
proach handles well large intra-class action variance (e.g. different 
views, individual appearances, spatio-temporal structures). The ex¬ 
perimental results on the challenging databases demonstrate supe¬ 
rior performance of our approach over other competing methods. 

Categories and Subject Descriptors 

1.5 [Computing Methodologies]: Pattern Recognition; 1.4 [Computing 
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1. INTRODUCTION 

With the popularity of personal video cameras and multi-view 
video capturing devices, we are entering an era with rich amount 
of multimedia documents surrounding us. To interact with these 
videos, there have been increasing demands of understanding hu¬ 
man activities in these videos. Although many research studies (45] 
[4TJ[3T] [46][22][5][29] [37] [32][8) have been carried out to understand 
and retrieve large scale video contents, these works focus on high 
level semantics instead of describing human activities. There still 
exists a need to recognize fine-grained information for human ac¬ 
tivities, e.g. body poses and temporal motions. 

This paper targets on the challenge of understanding spatial and 
temporal variances of video phenomenons. More specifically, we 
are considering the following difficulties: 

• A human body is composed of multiple parts, and different 
parts are associated with different motions. 

• Both the short term and long term motions may be fused 
by different background movement or camera motion, which 
brings the difficulties of accurately modeling the temporal 
characteristics for actions. 

• In real world videos, human actions often happen with un¬ 
certainties: some happen along with occlusions, some are 
caused by view/pose variance, or due to diverse actor appear¬ 
ances and motions. 

Due to these difficulties, it is crucial to reduce the spatio-temporal 
ambiguity when modeling the human actions. Most of the previous 
studies were built on simplified action models while overlooking 
the detailed spatio-temporal structure information. Of these works, 
a large amount of studies were based on spatio-temporal interest 
points (16]|42]|33]. Some researchers proposed to enrich the action 
model with the appearance information or context information (40[ 
[34) . Some other researchers learned temporal structures for action 
recognition |28] [35) . However, none of these works provides an 
effective model which can unify spatial and temporal information 
to infer the structure of human motion. 

We aims to develop an effective configurable model, namely the 
Spatio-Temporal And-Or Graph (STAOG) for action recognition, 
which addresses the problems mentioned above. Our idea is par¬ 
tially motivated by the image grammar model (47), which hierar¬ 
chically decomposes an image pattern with mixed and-nodes and 
or-nodes, as well as modeling rich structural variations of parts. 

The challenges of generalizing And-Or graphs for action recog¬ 
nition are two-folds. First, the traditional And-Or graphs are lim- 



Layer 1 Action-Triple Jump 



Figure 1: An example of the Spatio-Temporal And-Or Graph 
model, where the and-nodes represent compositions in either 
time or space, the or-nodes indicate structural alternatives, and 
the leaf-nodes (at the bottom) correspond to local part detec¬ 
tors. The links between leaf-nodes represent spatio-temporal 
contextual interactions. 


ited in modeling the hierarchical configuration of spatio-temporal 
information. Because actions in videos are often more complicated 
than images, we need more powerful models for video problems. 
Second, videos require more efficient models that can be effectively 
learned from large amount of video information without elaborate 
supervision and initialization. 

To handle the first challenge, our STAOG model extends the tra¬ 
ditional deformable graphical models by introducing switch vari¬ 
ables in hierarchy, i.e. or-nodes that explicitly specify structural 
reconfiguration. Both spatial and temporal interactions between 
action parts can be simultaneously learned. One action in the video 
can be treated as an ensemble of spatio-temporal compositions: a 
number of discrete temporal anchor frames, each of which is fur¬ 
ther decomposed into a layout of deformable parts. An example of 
the proposed STAOG model is illustrated in Fig. [T] There are four 
layers in our model from bottom to top: 

(1) The leaf-nodes in the bottom layer represent a batch of local 
classifiers for detecting various action parts in every anchor frame, 
denoted by the solid circles in Fig. [T] During detection, location 
displacement is allowed for each leaf-node to tackle the part defor¬ 
mation. 

(2) The or-nodes over the bottom are “switch” variables spec¬ 
ifying the activation of their children leaf-nodes, denoted by the 
dashed circles. Each or-node is used to specify an appropriate se¬ 
lection from candidate action parts detected by the associating chil¬ 
dren leaf-nodes. In this way, by explicitly switching selections over 
leaf-nodes, the or-nodes make our model reconfigurable during the 
inference of detection, which is the key to handle large action vari¬ 
abilities. 

(3) The and-nodes verify the holistic appearance of action within 
the anchor frame, (the rectangles in layer 2), and we thus consider 
it as the spatial and-node. It includes two aspects: (i) a global clas¬ 
sifier with bag-of-features, and (ii) aggregated scores from its chil¬ 
dren or-nodes. 

(4) The root-node in the top can be viewed as an and-node in 


time, (the rectangle in top). Its definition is similar to the spatial 
and-node: (i) a classifier with global features in observed frames, 
(ii) aggregated scores over candidate temporal anchor frames, plus 
penalty for anchor frame displacements. 

(5) The spatio-temporal contextual interactions , e.g. the curves 
(graph edges) among leaf-nodes in Fig. [I] are defined based on in¬ 
formative contextual pairwise relations in either spatial or temporal 
domain. Note that the collaborative edges are imposed between 
leaf-nodes that are associated with different action parts. Their ef¬ 
fectiveness will be particularly demonstrated in the experiments. 

To overcome the second challenge, we present a novel weakly 
supervised learning algorithm for model training, inspired by the 
non-convex optimization techniques (43] [38]. This algorithm trains 
the model in a dynamic manner: the model structure (e.g. the con¬ 
figuration of leaf-nodes and or-nodes) is iteratively generated and 
reconfigured on the training data, with optimizing the multi-layer 
parameters. The other structure attributes (e.g. the activation of 
leaf-nodes, and temporal deformation of anchor frames) are mod¬ 
eled with the latent variables and optimized simultaneously. 

In the testing stage, we present an algorithm of cascaded search 
and verification for recognizing actions with the trained STAOG 
model. We first generate a set of hypotheses in both spatial and tem¬ 
poral compositions, (i) Spatial testing via the and-nodes. Within 
the input frame, all candidate action parts are found by leaf-nodes 
and several possible configurations (i.e. spatial compositions of 
action parts) are produced with different specifications via the or- 
nodes. These configurations are also weighted via the and-node. 
(ii) Temporal testing via the root-node. The scores proposed by the 
and-nodes are aggregated via the root-node for a possible temporal 
composition. Several possible configurations are then produced as 
hypotheses represented by different latent variables. Finally, each 
hypothesis is globally verified with the spatial and temporal edges 
in the model. 

This paper is organized as follows. Section 2 gives a brief review 
of related work. Then we present our STAOG model in Section 

3, followed by a description of the inference procedure in Section 

4. Section 5 presents a description for structural learning of our 
model. The experimental results and comparisons are exhibited in 
Section 6. Section 7 concludes this paper. 

2. RELATED WORK 

Traditional works for action recognition focused on developing 
informative features, such as spatio-temporal descriptors fl6| [7] 
[46j,3D Gradient p"5) , 3D SIFT descriptors (29) and motion fea¬ 
tures (l4l |T8) , and the action classifier can be trained with labeled 
data. Most of these methods, however, are limited to periodic ac¬ 
tions with clean background, such as running and jogging. 

To address complex actions with cluttered background, several 
compositional or expressive models were proposed and achieve 
very impressive results (23] [5] [27] [39] |T3] |6) . For example, Wang 
et al. (39) modeled the human action by a flexible constellation of 
parts conditioned on image observations and learned the parame¬ 
ters of an HCRF model in a max-margin framework, motivated by 
the recent progresses in object recognition and detection, e.g. the 
deformable part model by Felzenszwalb et al. (TO) . Yao et al. (40) 
proposed to generate spatio-template action templates with the in¬ 
formation projection method. Sadanand et al (27) adopted high- 
level representations with a bank of individual action detectors. 
However, actions in video often involve much more information in 
both spatial and temporal domain, compared with image-based ob¬ 
ject recognition, and most of these studies do not explicitly localize 
parts of actions (actors) due to the computational burden. More¬ 
over, structural configurations of these models are usually fixed, 
























including a fixed number of part detectors as well as the predefined 
composition. 

One unique characteristic of human action recognition problem 
lies in the temporal structure. A lot of works were proposed to 
build temporal structure models (9][24|[35][4j based on discrimina¬ 
tive and interesting motion segments of the video. Raptis et al. {25j 
extracted clusters of trajectories and proposed a graphical model to 
incorporate constraints for individual and group events. Albanese 
et al. l[lj represented temporal relations of activities using the prob¬ 
abilistic Petri Nets and integrated high-level reasoning approaches. 
Different from these approaches, we do not treat the whole tempo¬ 
ral frames as units.Instead, we model temporal structure based on 
action parts with explicit relations and presents a solution to find 
both spatial and temporal configurations for dynamic activities. 

Recently, the And-Or graph models ED have been discussed 
for several vision tasks such as object recognition (20) and shape 
modeling fl9) . These works mainly focused on images instead of 
videos and do not take the temporal dynamic structure into account. 
The very recent work by Amer et al. (3) proposed to recognize 
activities with the spatio-temporal And-Or graph model, but they 
over-simplified the model training by manually fixing the model 
structure (i.e. the layout of graph nodes). 

It is worth mentioning that this paper learn the spatio-temporal 
graph without using any extra annotations or scripts. Research 
works which utilize rich annotation for event parsing and interpre¬ 
tation are beyond the scope of this work. In contrast, Marszalek 
et al. [23) explored the action contexts of natural dynamic scenes 
with movie scripts. Gupta et al. m proposed to learn a visually 
grounded storyline model from annotated videos, and Pei et al. (30) 
studied the event grammar model for daily activities based on a 
predefined set of unary and binary relations. Extra annotations are 
required for these studies. 

3. SPATIO-TEMPORAL AND-OR GRAPH 

The STAOG model is defined as Q = (V, £), where V represents 
the four types of nodes and £ the graph edges as Fig. [T] The root 
node in top verifies the temporal composition, which aggregates 
scores over anchor frames. Each and-node represents a temporal 
anchor frame for verifying spatial composition. The or-nodes are 
derived from each and-node, which are “switch” variables for spec¬ 
ifying the activation of their children leaf-nodes. The number of 
leaf-nodes for each or-node is dynamically learned with an upper 
limit number m. For simplicity, we use t = 1,..., T to index all 
and-nodes in the whole STAOG model, i =r A,..., Z for or-nodes 
and j = 1,..., n for leaf-nodes. We also index the child or-node 
of and-node A t as i E ch(t), and index the child leaf-node of 
or-node Ui as j E ch(i). The spatio-temporal graph edges(/.e. in¬ 
teractions) are defined between the leaf-nodes associated with dif¬ 
ferent or-nodes. In this section, we describe two factors in detail: 
the spatio-temporal compositions, and the contextual interactions 
in both spatial and temporal domains. 

3.1 Spatio-Temporal Compositions 

We employ Laptev’s 3-D corner detector m to detect inter¬ 
est points in video sequences, and each interest point is described 
by HoG (histogram of gradient) and HoF (histogram of optical 
flow) Gl) Furthermore, we generate a dictionary of spatio-temporal 
interest points’ descriptors, clustered by the k-means method in 
training stage. Given a video sequence X , we first equally divide 
it into T temporal segments. The center frame in each video seg¬ 
ment is chosen as an initial anchor frame. Each anchor frame is 
further decomposed into a number of action parts. In our method, 
we define the action parts based on the video patch representation, 



Figure 2: Illustration of spatial compositions, (a) The black 
boxes denote the initial positions of action parts, (b) The parts 
are exhibited which are associated with a set of leaf-nodes. 
Each ( d x ,d y ) indicates the location displacement determined 
by the model, (c) The activated leaf-nodes are highlighted by 
red and spatial contextual interactions are defined between the 
pairwise spatial adjacent leaf nodes. 


i.e. 3-D volumes spanning p consecutive frames. Thus, for each 
anchor frame It, we observe a sequence of frames centered at It, 
and the sequence denoted as A* is treated as the input for anchor 
frame processing. 

Leaf-node: Each leaf-node Lj represents a local classifier for 
detecting action parts within video patches and includes two terms: 
an appearance feature (j) 1 and the spatial displacement feature (j) s . 
Within an anchor frame I t , the features of action parts are described 
by the BoW histogram based on the generated dictionary. Assume 
the action part detected by Lj is localized at position pj = (jpj , p v -) 
, then (j) l (At,pj ) is denoted as the appearance feature. During de¬ 
tection, the locations of action parts are allowed to be perturbed 
to tackle the spatial deformation. We incorporate the spatial dis¬ 
placement c/) s (qj,pj) — ( d x ,d y ) for each action part, which can 
be computed by maximizing the response of Lj during inference; 
qj , representing the initialized position of Lj , is set according to the 
center-point of the frame. Thus the spatial displacement is defined 
as 4> s (qj,Pj) = (d x , d y ), where (d x , d y ) is the displacement.The 
response of Lj is defined as, 

Rj(Rt,Pj) = Uj ■ <A'(A t,Pj) - u] ■ 4> s {qj,Pj), (l) 

where c Jj is the parameter for the appearance feature and ujj corre¬ 
sponds to the spatial deformation parameter. 

Or-node: Each or-node Ui is proposed to specify an appro¬ 
priate candidate from its children leaf-nodes. For each leaf-node 
Lj of ULs children, the indicator variable Vj E {0,1} represents 
whether it is activated or not and each or-node only selects one 
leaf-node. Briefly, we utilize the indicator vector v* for the or-node 
Ui and each element of v* is an indicator variable Vj of the leaf- 
node Lj . Intuitively, the significant intra-class variance caused by 
views, background clutters or actors can be captured by different 
spatial configurations that are determined with the or-nodes. The 
response of the or-node Ui is defined as, 

Ri (At , Vi) = R l j (At ,pj)- Vj , (2) 

jech(i) 



























Figure 3: Illustration of the contextual relations for defining 
spatial edges in the STAOG model. We define the edges between 
spatial adjacent leaf-nodes with 8 relations according to their 
spatial layout: above, below, left, right, near, far, clockwise and 
anti-clockwise. 


in each anchor frame, as Fig. [2jc) illustrates. Note that we only 
link the edges between a pair of leaf-nodes that are respectively 
associated with two different or-nodes. 

For one edge connecting two leaf-nodes (Lj,Lj/), we define 
it with a set of informative relations, i.e. a 8-bin binary feature 
<p s (Lj , Lj/): above, below, left, right, near, far, clockwise and 
anti-clockwise between two adjacent leaf-nodes. The relations are 
visualized in Fig. [5] Suppose one edge connects two leaf-nodes 
(Lj, Lj/) which detect action parts at positions pj and pj> respec¬ 
tively. The centered red rectangle represents the location pj, and 
the other red rectangles represent the adjacent parts. In the right 
chart of Fig. [3] the dashed line represents the initial layout of the 
two leaf-nodes, and the black solid line the adjusted actual layout 
during inference. Then we define the relations as, 

• near or far: If pj> is fallen into the outer dashed ellipse, it 
is near to pj, i.e. the bin of near is activated (i.e. set as 1); 
otherwise it is far to pj. 


And-node: Each and-node A t verifies the holistic appearance 
of action for the anchor frame It, and spatial composition of the 
or-nodes in its children. We define the configuration vector Vt for 
all leaf-nodes within the anchor frame, which includes all indicator 
vectors v* corresponding to its children or-nodes Ui. The response 
of the and-node A t is defined as, 


• above, below, left or right: The corresponding bin is set as 
1 only if the center of pj/ is inside the corresponding dashed 
rectangles. 

• clockwise or anti-clockwise : one of the two relations is acti¬ 
vated (i.e. set as 1) according to the angle between the dashed 
line and the black solid line. 


R?(A t ,Vt)=“ a •</>“(A i )+ ^2 RU A t) vO, (3) 

i£ch(t) 

where f a (A t ) is the BOW histogram globally extracted from the 
3-D volume At centered at I t . The second term aggregates the 
response scores from all or-nodes of At s children. 

Root-node: The root-node is a global potential function that ver¬ 
ifies the temporal compatibility of model, including three terms: 
the global BoW histogram of the video clip, aggregated scores of its 
children and-nodes, and temporal displacements of anchor frames. 

Fig. [4] illustrates the temporal composition by the root-node. We 
employ the root-node for searching for the best localizations of T 
anchor frames. We introduce the latent variable At to indicate the 
temporal displacement of each anchor frame It, which will be cal¬ 
culated during inference. This implicitly carries the temporal or¬ 
dering constraints which are crucial for discriminating human ac¬ 
tivities. 

In particular, the temporal displacement penalty punishes the 
position of the and-node A t (i.e. one anchor frame) shifting far 
away from the initial anchor point r t in time. Once A t is optimized, 
the position of each anchor frame can be determined by r t + A t 
accordingly. We define £ t by, 

ft = -"t T ‘At, (4) 

where uf[ is the corresponding parameter. The response of the root- 
node can be then defined as, 

T 

R r (X, V, A) = uu r ■ 0 r (X)+ ^ Rt(At, V t ) + 6, (5) 

t= 1 

where <f r (X) is the BoW histogram feature extracted from the 
whole video sequence X. V = (Vi, • • • , Vt) and A = (Ai, • • • , At) 
are latent variables in the model for specifying the spatial and tem¬ 
poral configurations. 

3.2 Contextual Interactions 

Spatial interactions. We impose spatial contextual interactions, 
i.e. spatial edges, between pairwise spatially adjacent leaf-nodes 


The relations intuitively encode the spatial contexts of two action 
parts detected via the two leaf-nodes with respect to two different 
or-nodes. The response of the pairwise potentials can be parame¬ 
terized as, 

r ,V =F jj '-<p s {.L j ,L j ,), (6) 

where ftA, is the corresponding 8-bin parameter vector. 

Temporal interactions. We also impose the edges in tempo¬ 
ral domain in our model to represent the temporal interactions of 
action parts. The edges connect temporally adjacent leaf-nodes, 
illustrated in Fig. [4] The edges are connected between any pair of 
leaf-nodes (Lj , Lj/) that belong to the same part within the two ad¬ 
jacent anchor frames respectively. A set of temporal relations are 
collected to concatenate a 4-bin binary feature vector p T (Lj , Lj/). 

Specifically, we adopt four predicates: intersect, after, meets, 
interrupt, inspired by Allen’s temporal predicates (2) (26) . These 
predicates describe relations between two time intervals. The ac¬ 
tion part detected by one leaf-node Lj for a specific anchor frame 
is described by the feature <f l extracted from a 3-D volume with 
time span p. Note that we ignore some predicates of ordering such 
as before and equals, as the order of temporally adjacent anchor 
frames is supposed to be fixed. Assume that leaf-node Lj is asso¬ 
ciated with the anchor frame localized at r t + A t , (initial position 
plus displacement). The starting and ending time (/£*. art , /£™ d ) of 
Lj can be calculated as, 

nstart , a P 

J Lj = T t + A t ~ 

(7) 

f!2 d = rt+A t + ^. 

Then we can define the four temporal predicates for the two tem¬ 
poral adjacent leaf-nodes (Lj,Lj/) as, 


intersect(Lj, Lj') «=► fl], < SI 

after(Lj,Lj') «=► /£” d < < fZ? + P, 

meets(Lj, Lj') •<=>■ Sl^P = fl™ d , 
interrupt^ j,Lj ') •<=>■ /£" d + P < Sl^P- 

























Figure 4: Illustration of temporal compositions. The input 
video is decomposed into a number of discrete temporal anchor 
frames. The optimal position of each anchor frame is localized 
in A* +7t, i.e. the temporal displacement A t plus the initial an¬ 
chor point Tf The temporal contextual interactions are defined 
in temporally adjacent action parts. 


Thus, we define the response of one temporal edge linking two leaf- 
nodes accordingly, 


rjj' = Pjj' ■ <p T {Lj,Ly), (9) 

where /3Jj, is the corresponding 4-bin parameter. If the pairwise 
leaf-nodes (Lj ,Lj/) satisfies the specific predicate, the correspond¬ 
ing bin is set to 1 , otherwise 0 . 

Therefore, the overall response of the STAOG model is: 


R 3 (X,V,A) = R r (X,V,A) + J2( E 1 i;' • >'J • O' 

3 j / G7 s (i) 

+ E r 7 - v r v 3'), 

( 10 ) 

where (V, A) are the hidden variables in the STAOG model. The 
second term defines the spatio-temporal contextual interactions be¬ 
tween the leaf-nodes. 7 s (j) is denoted as the set of leaf-node L/s 
neighbors which are spatially adjacent to the leaf-node Lj, and 
7 T (j) is introduced for the leaf-nodes which are temporally ad¬ 
jacent to the leaf-node Lj. Intuitively, spatial interactions between 
action parts guarantee the spatial coherence, as well as temporal 
interactions embedding the temporal contextual relations. Briefly, 
we refer C — (V, A) as the latent variables in the following. The 
Eq[l0|can be briefly written as: 

R g (X,£) = rl>-*(X,£), (11) 


where ip includes the complete parameters of the STAOG model, 
and <E>(X, C) denotes the overall feature vector. 


4. INFERENCE 

The inference task is to detect T optimal temporal anchor frames 
for one video instance as well as the spatial composition of action 
parts within each anchor frame. In our approach, we perform a 
cascaded search that integrates three steps: spatial testing, tempo¬ 
ral testing and global verification to maximize the global potential 
R 9 (X, C) defined in Eq[l0] 


Step 1. Spatial Testing via the and-nodes. 

The subgraph of the STAOG model, rooted at one and-node, can 
be viewed as the spatial composition classifier for localizing action 
parts in one frame. We first use all existing leaf-nodes to search for 
candidate actions parts. Assume the leaf-node Lj associated with 
the frame I t detects the action part at the position p* by maximizing 
the response in Eq[3 Each or-node is allowed to activate only one 
leaf-node, then a possible configuration consisting of action parts 
is decided by the indicator variables of the or-node, (i.e. Vi for or- 
node Ui, indicating which leaf-node is activated). In this way, a set 
of possible configuration hypotheses {Vt} are generated for further 
testing, which ensemble the hypotheses proposed by the or-nodes 
for the frame I t . In practice, we limit the maximum number of 
hypotheses by setting a threshold on Ftp (At, Vt) in Eq[3] 


Algorithm 1 Inference Algorithm 

Input: 

A learned STAOG model G , the action parts detected by all leaf- 
nodes by maximizing the response in Eq0 

Initialization: 

The set of possible hypotheses l t = {} for all t E 1,... T anchor 
frames. 

Iteration: 

for all t = 1 • T do 

For each and-node A t , a set of temporal displacement steps E 
is predefined for sliding the possible anchor frames. 

for all At E E do 

(a) initialize the set of pair terms Q = {Qi,..., Qk} 
for all or-nodes of A t ’s children. 

(b) generate a set of pair terms Qi for each or-node Ui. 

• for all Ui,i E ch(t) do 

for all Lj , j E ch(i) do 

Qi = Qi u (i,j). 

end for 

• end for 

(c) obtain possible hypotheses V t = (vi,... : vk) by 
assembling the indicator variables of K or-nodes ac¬ 
cording to the set Q. 

(d) The set of possible hypotheses for each specific dis¬ 
placement At is constructed as l t = It U ({Vt}, At). 

end for 
end for 

Assemble these hypothesis {l t } for all T anchor frames orderly 
to generate the set of hypotheses sequence l. Each possible con¬ 
figuration (V, A) belongs to the set l. The global response of 
STAOG model can be calculated by Eq[l2| 

Output: 

The latent variables V, A and the final score S^(X). 


Step 2. Temporal Testing via the root-node. 

We apply the spatial testing with the and-nodes to localize a 
number of candidate anchor frames over several frames. The scores 
over candidate anchor frames (proposed by the and-nodes) are ag¬ 
gregated via the root-node for a possible temporal composition. For 
efficiency, we utilize a fixed number of discrete steps E for search¬ 
ing each anchor frame. Several possible hypotheses are then pro¬ 
duced with different anchor frame determinations by sliding the 
discrete steps E. In addition, we re-weight the hypotheses at the 
root-node in Eq[5] by considering the temporal displacements of 
anchor frames as well as the the global features over the video clip. 
Intuitively, the hypotheses are represented by the specified latent 
variables (V, A). 

Step 3. Global Verification. 

Given all the hypotheses from the root-node, we apply the global 
potential function defined in Eq[l0]to validate the optimal detec¬ 
tion. The objective of the global verification is to cope with the 
noisy local detections on leaf-nodes. It combines the score of the 




















root-node with the responses of spatial and temporal contextual in¬ 
teractions (edges). 

The optimal response S^p(X) of the model as well as the latent 
variables (V, A) can be calculated as, 

A)). ( i2) 

Algorithm [I] summarizes the overall algorithm of the inference. 


5. STRUCTURAL LEARNING 

We formulate the structural learning of STAOG model as a joint 
optimization task for model structure and parameters. We solve 
this model by a novel latent learning method extended from the 
CCCP framework |43) . This algorithm iterates to train the model 
in a dynamic manner: the leaf-nodes can be automatically created 
or removed to reconfigure the model structure. The model structure 
is determined by latent variables C — (V, A) that are inferred in 
each step. 

Let D = ((Xi, 2 /i), (X 2 ,2/ 2 ), • • •, (X N ,y N )) be a set of la¬ 
beled training samples, with yk G {1,-1}. The feature vector for 
each sample y ) is defined as, 


*(x,y,£) 


*(X,C) ify = +l, 

0 if y = — 1. 


(13) 


The temporal anchor frames and spatial configurations for each 
frame can be optimized by maximizing R 9 (X,£) in the inference 
procedure. We refer to £ = (V, A) as the latent variables, then 
we redefine Eq[T2] as, 

Sj,(X) = max(ip-$(X,y,C)), (14) 


The optimization of this function can be solved by the latent struc¬ 
tural SVM framework, 


1 ^ 

min -HV’H 2 + Cy2{max(ip ■ $(X k ,y, C) + h{y k ,y )) 

* 2 vX (15) 

- max(i> ■ $(X k ,y k ,C))\, 

where C is a penalty parameter set as 0.003 empirically and h(yk , y) 
is the cost function, where h(y k ,y) = 0 if yk = y, otherwise 1. 
The optimization problem described above is not convex in general. 
Following the CCCP framework, we convert the function in Eq[l5] 
into a convex and concave form as, 


min[h\ip\\ 2 + Cy2max(ip ■ $(X k ,y,£) + h(y k ,y))\ 

^ z f—' vX 

k =1 

N 

- [C^^max{i> • ^(X k ,y k ,C))] 

k =1 

where /(VO represents the first two terms, and g(ip) the last term. 
This leads to an iterative learning algorithm that alternates estimat¬ 
ing model parameters and the hidden variables C. However, we still 
need to dynamically determine the graph configuration, i.e. the pro¬ 
duction of leaf-nodes associated with or-nodes. An additional step 
for dynamically reconfiguring structure is added between the two 
original steps. The procedure is presented as follows. 

(I) The model parameter ipt obtained in the previous iteration is 
fixed. We find a hyperplane q t to upper bound the concave part 
gigfj) in Eq.[l6] Specifically, q t is the derivative of g( VO- Thus, we 
have —gty) < —g(ipt) + (ip — V^) • Qt,Vip- The optimal latent 


Algorithm 2 Learning algorithm for STAOG model 

Input: 

Training samples,!) = ((Xi, 2 /i), (X 2 ,y 2 ), • • •, (X N ,y N )). 

Output: 

The trained STAOG model. 

Initialization: 

1. Initialize the positions of action parts and anchor frames 
for all samples. 

2. Initialize the latent variables C and parameters ip. 

repeat 

1. Estimate the latent variable C k — (V£, Al) for each pos¬ 
itive sample (Xk,yk) during inference. 

2. (a) Localize the anchor frames and action parts using 

the current latent variables (V*, A*). 

(b) For each or-node Ui, we obtain a set of feature vec¬ 
tors of its children leaf-nodes for positive samples, 
and regroup the feature vectors by Spectral cluster¬ 
ing. 

(c) Reconfigure leaf-nodes according to clustering re¬ 
sults to gene rate a new structure. Calculate the en- 
ergy Eq.[n]as E(ipf). 

3. - if < E(ip t ) then 

Accept the new model structure and estimate the 
parameters ipt+i = argmin^(f(ip + V> * Qt))- 

- else 

keep the previous model structure. Estimate the 
parameters V>t +1 = argmin^{f(gjj + V 7 ' Qt))- 

- end if 

until The optimization function defined in Eq.[l5] converges. 


variable C* k is calculated by £\ = argmax * T>(Xfc, yk, £)) 
for each positive example. Then the hyperplane is constructed as 

qt = -cj2» =1 <s>(x k , yk ,ci). 

(II) In the second step, the STAOG model is adjusted by struc¬ 
tural reconfiguration with applying the current model on training 
examples. The reconfiguration is performed for each part, i.e. or- 
node, independently, with the fixed latent variable Cl = ( 14 *, Aj£). 
Note that each action part detected by one leaf-node is mapped to 
several feature bins at specific positions in the vector <f >{Xk ,yk,£k) 

For each or-node Ui, we apply its children leaf-nodes for de¬ 
tecting parts in all positive samples. Assume that leaf-node Lj de¬ 
tects an action part on the k -th sample using the feature vector 7r^, 
which is a sub-vector of the complete feature vector of k -th sam¬ 
ple, <f>(A4, yk, Cl). And we obtain a set of feature vector { 7 ^} for 
all samples. The vectors detected by the same leaf-node are first 
grouped into one cluster, i.e. one cluster for one leaf-node. We de¬ 
note the cluster for the j-th leaf-node as Qj. Then we perform the 
spectral clustering algorithm with the Euclidean distance on vectors 
of all leaf-nodes of or-node UC s children for all positive samples, 
and the similar vectors are grouped together. We re-arrange the fea¬ 
ture vectors for all samples based on the newly generated partition. 
For example, if the feature vector tvj is grouped from ftj into an¬ 
other cluster Qj, we adjust the position of 7 Tj in <f>(A4, y/c, C%), 
i.e. by moving the feature bins into the position representing j'- 
th leaf-node. If Q'- is a newly generated cluster, we thus create a 
new leaf-node accordingly. By analogy, we remove one leaf-node 
if few samples are grouped into the corresponding cluster. In this 
way, the structure of Ui is reconfigured with the feature vector re¬ 
arrangement. In practice, we constrain the extent of structural re¬ 
configuration, i.e. only few leaf-nodes can be created or removed 
in one iteration. We present a toy example in Fig[5]for illustration. 
In Fig[5ja), a leaf-node associated with the or-node U 7 is created 
to better handle the intra-class variance; A leaf-node is removed if 
there is another similar one, (e.g. the leaf-node associated with the 
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Figure 5: Illustration of discriminative structural learning. We reconfigure the model structure by re-arranging the feature vector, 
as the example illustrated. Parts of the STAOG model reconfigured in two iterations are shown in (a), where the left one represents 
the original model and the other one the new model. During this step, the new leaf-node associated with U 7 and Uu are created 
and a leaf-node associated with Us is removed. Assume that we use 5 samples, X\ ,..., X 5 , for the structure learning, (b) shows the 
feature vectors detected by the same leaf-node are first grouped into one cluster, i.e. one cluster for one leaf-node, (c) illustrates the 
feature rearrangement after clustering. For example, the feature vector of sample X\ is grouped from cluster Og into cluster O 13 , 
we move the feature bins ttq into the bins corresponding to the leaf-node Lg. Cluster Q 14 is a newly generated cluster, we thus create 
a new leaf-node accordingly. 


or-node Us). The sub-vector of 719 of sample X\ is grouped from 
cluster Og to cluster O 13 , then the feature bins are moved from 7 Tg 
to 7T13 as FigHb) shows. 

After this structure reconfiguration, we obtain the new feature 
vector for each sample, & d (Xk, yk, ££), and the hyperplane is re¬ 
calculated as q d = -C J2k=i ® d ( x k, Vk, ££), accordingly. 

(Ill) The newly generated model structure can be represented by 
the feature vector & d (Xk, yk. C%). The model parameters can be 
learned by solving ip* = argmin^ (/(^ + ip • qf)). The optimiza¬ 
tion task in Eqp~5]becomes, 

1 N 

min- \\ip\\ 2 + C'V][max('4> ■ $(X k ,y,C) + h(y k ,y)) 

* 2 t^i yX ( 17 ) 

$ d {X k ,y k ,Cl ))]• 

This is a standard structural SVM problem, which can be solved 
in the cutting plane method and Sequential Minimal Optimization. 
The energy Eq[n]can be calculated by E(ijj d ) — ftyf) — g{^>t)- 
We accept the new model structure until E{^> d ) < E(pjj t ) and 
'ipt+i — ipt- Otherwise, we keep the model structure as in the pre¬ 
vious iteration and the parameter vector is calculated by ipt+i = 
argmin^(f( 2 p + 'i/j • q t )). Thus, we ensure that the optimization 
function will decrease in each iteration. We repeat the 3-step iter¬ 
ation until convergence. Algorithm[2] summarizes the overall algo¬ 
rithm of learning a STAOG model. In the case of multi-class clas¬ 
sification, we use a one-against-rest approach and select the class 
with the highest score. 

6. EXPERIMENTS 

We test our STAOG model on two different action recognition 
databases: UCF YouTube |2T| and Olympics Sports |24| . The 
video resolution is normalized to 320 x 240. The YouTube dataset 
contains 11 action categories, which is challenging due to large 
variation in camera motion, object appearance, pose/view and large 
intra-class variability. We follow the standard setup using leave- 
one-out cross validation for a pre-defined set of 25 folds.Average 


accuracy over all classes is reported as performance measure. The 
Olympic Sports dataset consists of 16 different sports classes that 
contain complex motions going beyond simple punctual or repet¬ 
itive actions. The challenges of Olympic sports arise from back¬ 
ground clutters, viewpoints and complex sequence of primitive ac¬ 
tions. Each action is performed only by a single actor and repre¬ 
sents a temporal sequence of primitive actions ( e.g . triple-jumping, 
pole-vault and diving). We use the same train-test split setup and 
the average precision (AP) for each of the action classes as in J35| . 

6.1 Implementation 

We fix the number of and-nodes (anchor frames) in the STAOG 
model as T = 3 for UCF YouTube dataset, and T — 5 for Olympic 
Sports dataset empirically. We need more anchor frames for Olympic 
dataset because the actions are more complex and last over more 
frames. The parameter T can be roughly estimated by the action 
temporal complexity in general. The number of spatial layout for 
each anchor frame is fixed as 2 x 2, and thus K — 4 or-nodes for 
each anchor frame. There are at most m — 4 leaf-nodes associ¬ 
ated to each or-node. We extract the interest points described by 
the HOG and HOF features by utilizing the code published in CCD 
beforehand. The size of one action part (within a 3-D volume) is 
empirically set to 60 x 60 pixels spanning p — 15 frames. The 
dimension of the generated dictionary is set as 300 for describing 
action parts in each anchor frame and the global features of the 
anchor frames. We set the discrete temporal steps searching for 
anchor frames: £ — [zb2, ±4, ±6, ±8, ±10]. The convergence of 
our learning algorithm usually takes 9 ~ 10 iterations. 

The experiments are carried out on a PC with Core 15 3.0GHZ 
CPU and 4GB memory. The average CPU-time used to process a 
video from Olympic Sports dataset is 200 seconds and 150 for a 
video in the UCF YouTube dataset. In particular, it takes 7 seconds 
for processing one frame in Olympic Sports dataset, and 4 seconds 
for each frame in the UCF YouTube dataset on average. The ef¬ 
ficiency of our method is slightly diverse with the densities of the 
feature points in the video sequence. 
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Ours-2 

Ours (full) 

b_shoot 

53.0% 

98.0% 

48.5% 

43.0% 

58.4% 

62.0% 

77.9% 

bike 

73.0% 

74.0% 

75.2% 

91.7% 

82.1% 

87.3% 

88.6% 

dive 

81.0% 

80.0% 

95.0% 

99.0% 

98.4% 

98.6% 

98.8% 

golf 

86.0% 

68.0% 

95.0% 

97.0% 

95.7% 

95.3% 

97.4% 

h_ride 

72.0% 

65.0% 

73.0% 

85.0% 

81.3% 

86.0% 

88 .0% 

sjuggle 

54.0% 

67.0% 

53.0% 

76.0% 

66.0% 

81.6% 

82.2% 

swing 

57.0% 

71.0% 

66.0% 

88 .0% 

85.2% 

84.1% 

85.4% 

t_swing 

80.0% 

68.0% 

77.0% 

71.0% 

69.6% 

80.0% 

80.7% 

tjump 

79.0% 

80.0% 

93.0% 

94.0% 

90.2% 

94.7% 

95.8% 

v_spike 

73.3% 

77.0% 

85.0% 

95.0% 

89.7% 

90.6% 

96.4% 

walk 

75.0% 

54.0% 

66.7% 

87.0% 

85.2% 

86.2% 

87.4% 

Accuracy 

71.2% 

72.9% 

75.2 % 

84.2% 

82.0% 

86 .0% 

88.9% 


Table 1: Accuracy per action class and average accuracy for all classes on the YouTube dataset (21) . 
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Figure 6 : Example inference results on three different action models e.g. basketball-layup (a), clean-and-jerk (b) and triple-jump(c) 
learned on the Olympic Sports dataset and each of action category includes two instances. The red boxes in each frame represent the 
discovered discriminative action parts. Our model successfully localizes the accurate anchor frames across the instances in the long 
action videos. In addition, it is noticed that the large intra-class variabilities can be captured by our model. 
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Ours(full) 

h-jump 

27.0% 

18.4% 

35.6% 

l-jump 

71.7% 

81.8% 

86.4% 

t-jump 

10.1% 

16.1% 

36.2% 

p-vault 

90.8% 

84.9% 

84.3% 

g-vault 

86 .1% 

85.7% 

83.1% 

s-put 

37.3% 

43.3% 

56.8% 

snatch 

54.2% 

88.6% 

89.0% 

c-jerk 

70.6% 

78.2% 

83.3% 

j-throw 

85.0% 

79.5% 

78.1% 

h-throw 

71.2% 

70.5% 

75.4% 

d-throw 

47.3% 

48.9% 

53.3% 

d-platform 

95.4% 

93.7% 

92.8% 

d-board 

84.3% 

79.3% 

76.5% 

basketball 

82.1% 

85.5% 

86.7% 

bowling 

53.0% 

64.3% 

62.0% 

t-serve 

33.4% 

49.6% 

62.3% 

mAP 

62.5% 

66.8% 

71.4% 


Table 2: Average Precision(AP) values on the Olympic Sports 
dataset (24). 


6.2 Results and Comparisons 

Compared with the recently proposed methods on the YouTube 
dataset, our model outperforms the state-of-the-art: we achieve the 
accuracy of 88.9% in YouTube dataset, the reported results of the 
competing algorithms are: 71.2% in (21) , 72.9% in (44), 75.2% 
in 03 and 84.2% in |36|. The accuracy scores for all categories 
are reported in Table[l] Our method outperforms on 7 out of the 11 
categories which have relatively large intra-class variance or back¬ 
ground disturbance. In the Olympic Sports dataset, we obtain bet¬ 
ter AP scores for 10 out of the 16 categories, and overall AP score 
71.4%, better than the previous methods (35 24]. The competing 
method proposed by Tang et al. (35) utilizes the variable-duration 
HMM for learning the temporal structure in the video. Our results 
show that our compositional model with the explicit spatial and 
temporal relations can achieve better performance. The detailed 
results are reported in Table [2] 

Figure. [6] illustrates the inference results on three action cate¬ 
gories from the Olympic Sports dataset, each of which includes 
two instances. Our model also localizes the action parts in the 
anchor frames, as they are discriminative in appearance and mo¬ 
tion. These results demonstrate well the capability of our model, 
since the scenarios of actions contain the very realistic challenges 
in video action recognition. The spatial compositions defined over 
the and-nodes enable us to handle pose/view variations and back¬ 
ground disturbances. The temporal compositions are effective to 
localize anchor frames in videos against various motion frequency, 
temporal locations and video length. 

For further evaluation, we conduct three empirical analysis in 
different model settings as follows. 

(I) We simplify the temporal compositions by discarding the dis¬ 
placement At for each anchor frame, i.e. fixing the temporal struc¬ 
ture. We report AP scores of this model setting in the fourth column 
of Table[l] named as "Ours-1". The average accuracy is 82%,6.9% 
less than our complete model. 

(II) We also evaluate the benefit of spatio-temporal contextual 
interactions. Our model can be simplified into a tree structure 
by removing the interactions. The accuracies are shown in the 
sixth column of Table [T] named as "Ours-2". We can observe that 
the spatio-temporal contextual interactions make the accuracies in¬ 
crease 2.9% on average. The increased performance of the interac¬ 



Figure 7: Empirical analysis for different settings of spatial 
compositions, where we set different maximum numbers m of 
leaf-nodes under the or-nodes. Each pillar represents the ac¬ 
curacy for one action category. The color indicates the results 
with different settings, m = 2 or m = 4. 

tions also speaks in favour for our model, as it shows that through 
better associations between the anchor frames and action parts, it is 
possible to achieve even better accuracies. 

(Ill) One may be interested in how the performances are im¬ 
proved by introducing the or-nodes in spatial compositions, which 
is one of the key components in the STAOG model. In this ex¬ 
periment, we set different maximum numbers m of leaf-nodes un¬ 
der the or-nodes, i.e. how many leaf-nodes at most can be created 
for the model. We compare the results on the YouTube dataset in 
Fig.[7] and observe that m — 4 achieves the better results in general 
than m — 2, 82.4%. In practice, the number m is not sensitive, as 
the exact number of leaf-nodes is decided by the clustering on data 
during the structural learning. 

7. CONCLUSION 

This paper studies a novel hierarchical model for human action 
recognition, in the form of a configurable Spatio-Temporal And-Or 
Graph. This model is shown to handle well realistic challenges in 
action recognition. Moreover, we consider two aspects to improve 
our method. First, the model can be integrated with high level se¬ 
mantic information to represent multi-agent complex events. Sec¬ 
ond, we plan to speed up the algorithm for large-scale processing. 
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